Autonomous Systems - Deep Learning

Project Task 2 - Face Recognition with Transferlearning.

Create a neural network to recognize your face and clearly discern it from other faces and objects!

Markus Ullenbruch

This notebook was executed and implemented in Windows 10 operating system. Due to the corona pandemie I was not able to test the code on a Linux operating system at university.

0. Introduction

The scope of the project work of the lecture "Autonomous Systems - Deep Learning" is to implement software to recognize my own face and clearly discern it from other human faces and objects.
This task must be addressed and solved using deep learning with transfer learning and the framework that will be used is tensorflow as the backend and the High-level API Keras in the python programming language.

The problem in deep learning and facial recognition is that we do not "hardcode" mathematical rules and alghotihms to recognize the classes. Deep Learning learns from experience, that means looking at labeled training data, making predictions on it, compare these predictions with the ground truth label and then optimize a error loss function based on the predictions. Optimize the error loss function means to search for the global minimum.

To solve this task, big amount of data is needed in order to secure a good performance of the deep learning model. Based on the training data on which a deep learning model is trained on, the model should generalize from that training data to unseen data, so that the model can make predictions on this new and unseen data. This is very important because in real life applications you need to make predictions on new data that the model was not trained on! Real life applications can be to unlock the smartphone screen when a specific face looks onto the screen or to get access to rooms in companies per face recognition.

In order to train the model properly and provide good performance, the quality of the data is from very high importance. A dataset will be presented in this project where 40% of the data is created by myself and the other 60% of the data was downloaded from research and educational datasets, provided in the internet from different sources. The dataset which is uploaded with this notebook contains prepared data on which training can be done. The deep learning network will be created based on the gained knowledge during the lecture from Prof. Dr.-Ing. Stache.

First the used dataset is described and the data preparation is done, then the neural network is trained based this prepared dataset and in the last steps an evaluation of the resulting model and a discussion of the results is presented.
Before the discussion of the results, a second model, which is designed to work efficiently on mobile devices, is also trained and shortly evaluated.

1. Load modules and packages

First all the needed modules and packages are imported to provide all the functionality und functions we need to solve the task. For example, many keras functionalities are imported in order to set up the deep learning model structure and to train it. The os module is imported to handle relative and absolute paths safely on different operating systems and provide executability of the implementation. Other modules like PIL are imported to perform image manipulations, numpy for scientific calculations and matplotlib to plot graphs and results.

In a next step, it is checked wether a GPU (Graphics Processing Unit) is available to speed up the training process. With a GPU the network could be trained a lot faster, which is more comfortable due to fine tuning the hyperparameters based on the training and validation dataset. A GPU was used to fine tune the model at a friend's computer (Nvidia GeForce GTX 1080 Ti) since I don't own a GPU.

2. The Raw Dataset

Getting and collecting data is the most crucial and important part when it comes to make predictions with a deep neural network. Often it is a very underestimated and very time consuming activity to get, create and prepare the data for data science projects.
Examples of the dataset will be shown later in this notebook.
For the classification task, three classes were used. In the following the used dataset, sources and classes are described:

2.1 Class "Markus"

One of the tasks of this project is to collect and create own data. To get the data of my own face, datasamples of the category "Markus" are created by myself.
Images were made with an iPhone 8 with its 7 Megapixel front camera and saved as JPEG file format. Selfies were made from slightly varrying distances, but it was importand to try to match the positions of the faces of other humans in the dataset "Others", so that the relevant features from the faces can be learnt and not for example the camera position or the background as main feature to make predictions on.
The images are quadratic and the whole face including hair and a little bit of the t-shiert is seen on the foto. A little background and clothes of myself are seen on the images. Images were made with different t-shierts/pullovers and in front of different backgrounds.
Sometimes I made a foto series of 50 fotos in a row in a few seconds with the foto-series function of the iPhone, slightly varying the angle or distance of the camera with my arm while taking the images.
Examples of the images are shown later in this notebook.

2.2 Class "Others"

The photos of random human faces were downloaded from the data science related website kaggle, which are offering the images for free and with non commercial research and educational purposes only.
The dataset is the Flickr-Faces-HQ Dataset (FFHQ) downloaded from https://www.kaggle.com/arnaud58/flickrfaceshq-dataset-ffhq/data

This dataset consists of 52.000 PNG images at a quadratic resolution of 512Ɨ512 and contains different groups of people due to age, ethnicity, nationality, hairstyle, clothes, facial expression and image background. The original source of the images is Flickr and were web scraped from there. The original images were automatically aligned and cropped by the creators using dlib.

The dataset is originally created from:

A Style-Based Generator Architecture for Generative Adversarial Networks
Tero Karras (NVIDIA), Samuli Laine (NVIDIA), Timo Aila (NVIDIA)
https://arxiv.org/abs/1812.04948

2.3 Class "Objects"

A dataset of objects was downloaded containing objects like ships, cars, aircrafts, bags, pencils, machines etc. It contains 1852 different objects and has over 26.000 images.
This dataset is the "THINGS object concept and object image database" downloaded from: https://osf.io/jum2f/

The dataset is created by:

THINGS: A database of 1,854 object concepts and more than 26,000 naturalistic object images.
Martin N. Hebart, Adam H. Dickter, Alexis Kidder, Wan Y. Kwok, Anna Corriveau, Caitlin Van Wicklin & Chris I. Baker
Laboratory of Brain and Cognition, National Institute of Mental Health, National Institutes of Health, Bethesda MD, USA

Hint: Examples of the preprocessed data are shown in Chapter "4. Examples of the data" and more specific information of the (preprocessed) image dataset is showed in Chapter "3. Data Preparation".

3. Data Preparation

3.1 What is done?

First the folder structure of the processed and prepared training, validation and test images of the classes are checked wether they are empty or containing processed data. If one folder is empty the data preparation of that specific class is done and if processed data is available for that specific class in every training, validation and test folder, nothing is done.

When data preparation is executed e.g. for the class "Markus", then all images of the class "Markus" in training, validation and test folders are deleted.
The raw image fotos (jpeg or png files) are then cropped to have the same view of the faces in both datasets "Markus" and "Others" and then they are resized to match the image size of the neural network (240x240) and to decrease the memory size of the images in order to upload them into ILIAS.
Splitting up the processed data in Training, Validation and Testing datasets and saving them in the corresponding folder structure is also done in the preparation process.
The target file format of the preprocessed data is JPEG file format and is uploaded in ILIAS along with this jupyter notebook.

3.2 Folder Structure of the dataset

The folders which contains the processed training, validation and test dataset is set up in the following folder structure:

folders.PNG

This specific folder structure is important so that we can feed the data to our model during training process with Keras.

Markus Dataset

Others Dataset

Objects Dataset

3.3 Dataset Sample Information

The following chart contains information about the sample size of the dataset in absolute and relative numbers of image samples and information about the content of the dataset.

data.PNG

3.4 Define training, validation and test path for Keras

4. Examples of the data

In the following section, examples of the preprocessed dataset from all classes are shown.

4.1 Self created data - "Markus"

This is the data i have created on my own with a smartphone front camera (iPhone 8). Data is showed from all three training, validation and test dataset.

4.2 The Dataset - "Others"

This is the dataset from other human faces from the Flickr-Faces-HQ Dataset (FFHQ) downloaded from https://www.kaggle.com/arnaud58/flickrfaceshq-dataset-ffhq/data.'. Only random images from the training dataset are showed.

4.3. The Dataset - "Objects"

"THINGS object concept and object image database" downloaded from: https://osf.io/jum2f/. Only random images from the training dataset are showed.

5. Network Training

5.1 The Deep Neural Network Model - Transfer Learning

Since transfer learning is used in this task, a already pretrained network will be used here. The following image from mathworks describes the pipeline of the Transfer learning process applied in this Notebook: transferlearningworkflow.png source of image: https://de.mathworks.com/help/deeplearning/ug/train-deep-learning-network-to-classify-new-images.html

First, the pretrained network is loaded and the top layers (fully connected layers), which are the classifying layers based on the extracted features from the layers before, are replaced by our own layers.

Then the network will be trained, but only the weights of our own defined fully-connected top layers are changed. The weights of the base model, which is extracting the features with Convolutional layers will not be trained.

In the next step, the accuracy, precision and recall of the resulting model will be evaluated based on the training, validation and test dataset.

In the last step the resulting model and its results are plotted and visualized. The results of the model, like accuracy of the validation dataset, can be used in a feedback loop back to the training process to finetune the hyperparameters.

5.2 Load the network - InceptionResNetV2

For transfer learning, we need to load a pre-built and pre-trained deep neural network architecture. I decided to use the InceptionResNetV2 model. It was trained on the imagenet dataset, which weights are specifically loaded in this notebook. The input size is defined as the above chosen image size (240, 240) plus the three color channels for the RGB color space.

The top layers of the network are specifically not loaded on top of the model with include_top=False. This step is crucial for the transfer learning since we want to train the model only on those layers we are including by our own on top of the model output without top layers. The network should then be trained on how to classify my own three classes based on the feature extraction of the base model of the pretrained network.

In the following, the InceptionResNetV2 model is defined without top layers as the base_model. On top of this base model, we can define layers over layers as we want!
The weights of the base model are set to be freezed, which means, they will not be trained during the training process! Only the weights of our custom defined top layers we set on top will be trained in the whole training process, learning to do the classification task.

5.3 Add layers to base model

Generate a Keras Model, double check the final structure and compile it.

The output of the convolutional layers of the base model will be flattened, meaning reshaping the output tensor to a 1D vector. This 1D vector will be the input to the fully connected layer to classify the images.

With the Keras "Dense" function, a fully connected prediction layer is created with 3 neurons (because we have 3 classes to predict). The flattened 1D layer is fully connected to this prediction layer.
The following image shows the structure of the classification layer and its integration to the base model:

struktur.png

Overfitting is a phenomenon in machine learning that occurs when a machine learning model learns the training data by hard so the model is specificaly tailored to the training dataset and cannot generalise to other, for the network unseen, datasets. With the above defined layers no overfitting occurs.

5.4 Define Hyperparameteres

Hyper parameters like epochs and batch size are defined in the following. Epochs is the number of loops the model is trained on the same whole training dataset. The weights of the model will not be updated all at once when all the training data is processed, but the weights are updated many times per epoch. This is defined by the batch size. Batch size is the number of images, the model is processing at once, calculating the gradient of the error function of all images in the batch size, then updating the weights after processing this batch, before the next batch of images will be processed.
Hyperparameters have significant impact on the results and performances of neural networks.

5.5 ImageDataGenerator

An ImageDataGenerator generate batches of tensor image data with real-time data augmentation while training. The model will be trained on augmented data, which makes the model more robust, since the training data will be manipulated in order to be more difficult to predict. When the training process is set up on those more difficult images, it will perform better on an unmanipulated dataset.
Another profit from this method is, that you can generate more training data, when having not many images for the training process or when many images are the same because images were made with a foto-series function of a camera.
Data Augmentation produces more diverse images when the images in a batch are similar, which can be the case of images from the class "Markus", since I made many fotos in very short time with the camera's foto-series function.

We need data augmentation on the training data only, because we want to generate more different data for training andt to make the network robust. Validation and Testing should be done with non-augmented data!

Create 3 seperate generators for training, validation and testing dataset:

5.6 Define Callbacks

An "Early Stopping" Callback is defined to stop training after a specified number of epochs, if no improvement in the monitored value is happening during those last epochs.
The best model weights in sense of the monitored value will be restored and chosen as the model weights of the trained network. The value to monitor is the validation accuracy.

5.7 Start Training

6. Evaluation

The best performing model during training is used to evaluate the dataset.

6.1 Loss and Accuracy of training, validation and test dataset

Evaluate and print the losses and accuracies of the three datasets:

The accuracy values of the unseen data of validation and test dataset are at 98,5% and 98,6% which is a very good result since most images are predicted correctly!

6.2 Plot of Loss & Validation accuracy versus epochs during training

6.3 Precision & Recall

In order to find out which classes interfere with each other, a confusion matrix will be created.

Other metrics than accuracy will be calculated with the sklearn.metrics package like Precision and Recall. Based on a 2 class confusion matrix (image below), the True positives, true negatives, false positives and false negatives can be calculated and different metrics can be evaluated based on that. There are many more metrics to evaluate a model, as you can see in the following image:

Unbenannt.PNG

Source of Image: https://en.wikipedia.org/wiki/Precision_and_recall

We will focus on accuracy, precision and recall.

The accuracy (ACC) is defined as the sum of all true positives plus the sum of all true negatives divided by the sum of the total population (total samples of images):
$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}.$
It answers the question: "What proportion of all predictions on the images were actually predicted correct?"
But sometimes, especially on imbalanced datasets, the evaluation of only accuracy as metric can be misleading and dangerous because accuracy alone doesn't give a good impression of the performance of the model. Because of this fact, we need to evaluate other metrics like precision and recall to be sure to have a good model.

The Recall (also known as Sensitivity or true positive rate (TPR)) is defined as:
$\text{Recall} = \frac{TP}{TP + FN}$
Recall answers the question "What proportion of actually positive conditions was identified correctly?"
Or in terms of facial recognition: what proportion of actually "Markus" images was identified correctly as "Markus"?

The precision (also known as positive predictive value (PPV)) is defined as:
$\text{Precision} = \frac{TP}{TP + FP}$
Precision is answering the question "What proportion of positive identifications was actually correct?"
Or: What proportion of all identifications as "Markus" was actually an image of "Markus"?
(Definition of recall and precision source: https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall)

Every class has their own value of precision and recall. For example there is the positive condition "Markus" which includes all samples of the class "Markus" and there is the negative condition "not Markus", which includes the remaining classes "Objects" and "Others". The same for the other two classes "Others" and "Objects". Based on that, three 2x2 confusion matrices can be calculated and a precision and recall value for each class can be calculated.

Accuracy

The accuracy of the model on unseen data (validation and test data) is greater than 98,5 % in both the validation and test dataset which is a very good result.

Precision

In the validation dataset the class "Markus" have the lowest precision with 96,9% and the class "Objects" have the highest with 100%.
In the test dataset the class "Markus" have the lowest precision with 96,6% and the class "Objects" and "Others" have the highest with 100%.

Recall

In the validation dataset the class "Others" have the lowest recall with 96,5% and the class "Markus" have the highest with 99.6%.
In the test dataset the class "Others" have the lowest recall with 96,5% and the class "Markus" has the highest with 100%.

6.4 Plot confusion matrix

The confusion matrix is plotted with the code from the Fruit Examples notebook of Prof. Dr-Ing. Stache.

In the training dataset the network predicted 30 Images of "Others" falsely as "Markus", 5 images of "Objects" falsely as "Markus" and 2 images of "Markus" falsely as "Others". All of the other 3092 training images were correctly classified.

Confusion matrix on validation data

In the validation dataset the network predicted 7 Images of "Others" falsely as "Markus", 1 image of "Objects" falsely as "Markus" and 1 image of "Markus" falsely as "Others". All of the other 606 validation images were correctly classified.

Confusion matrix on test data

In the test dataset the network predicted 4 Images of "Others" falsely as "Markus", 1 image of "Objects" falsely as "Markus". All of the other 344 test images were correctly classified.

7. Application on mobile devices or embedded systems - MobileNetV2

Since the deep learning network "InceptionResNetV2" has a memory size of 215 MB and a depth of 572, it is not very efficient for the use on mobile devices in real time apps or on embedded systems when computational power is limited.
Because of that, researchers from Google released a neural network architecture, the "MobileNetV2", that is optimized for the use on mobile devices and embedded vision applications. It has a memory size of only 14 MB and a depth of layers of 88.

This network has high accuracy results on image classification tasks while keeping the parameters and mathematical operations as low as possible, so that it can work fast and efficient on mobile devices.
Neural networks need to work efficient these days because new mobile applications entering the market allows users to interact with the real world in realtime. The MobileNetV2 network is an improved version of the MobileNetV1 in terms like:

7.1 Training the MobileNetV2

In the following code sections, a neural network is trained with transfer learning based on the MobileNetV2 from Google. The pipeline of the training is the same as training the InceptionResNetV2 in all the above code cells in this notebook. Data Augmentation is used in the training dataset, too. The Keras ImageData generators have to be created again, because we need to pass the preprocess_input function of the MobileNetV2 to meet the preprocessing needs of the new network.

The MobileNetV2 is loaded without top layers and with all weights of the basemodel freezed. The output of the basemodel of MobileNetV2 is then flattened and then connected to a fully connected Dense layer with 200 Neuron units and a relu activation function. This layer is then connected to another Dense layer with 3 neurons with a softmax activation function (prediction layer). The top layer structure is shown in the following image:

struktur_mobile.png

The training will be stopped with an Early-Stopping callback after 20 epochs of no improvement in validation accuracy. Because of the similarity of the training process from the first deep learning model, the code below for the MobileNetV2 is compressed in few code cells.

Data Generators

Define MobileNetV2 transfer learning model

Evaluation of MobileNetV2

7.2 Single Predictions on challanging Images (Me vs. my brother)

In this section single predictions are done with the trained "MobileNetV2" network of images that are subjectively interpreted as "difficult", e.g. an image from me when wearing a facemask and images of my brother who is looking a lot like me.

The neural network is challenged when it comes to predictions on images of my brother, who is looking a lot like me. As one can see above, the network sometimes predicts images of my brother as "Markus" and sometimes as "Others".
Images of my brother are not in the training dataset, so the network was not trained on images from him.
My brother is looking a lot more like me than most of the people in my training set despite the fact, that many young men are in the training dataset. I expected exactly this, that the network will have trouble to identify my brother correctly as "Others".

But the results are pretty good: There are 17 single images of my brother: 13 were correctly identified as "Others" (True Positive) and 4 were incorrectly identified as "Markus" (False Negative).

From 3 images of "Markus" waring a face mask, 2 were correctly predicted as "Markus" and 1 was incorrectly predicted as "Others". I expected the image where the mask is down on my chin and the whole face can be seen to be predicted correctly but this image was classified falsely as "Others"!

8. Discussion of the results

On the first try, training a neural network based on the "Xception" model was done and the results were not that satisfying with validation accuracy pending around 89%. Training the Xception model is not showed in this notebook. I tried a few more models and finally the "InceptionResNetV2"-model was the best for my task in terms of accuracy, precision and recall with values around 97-99% in all three metrics!
As it turned out, the decision of which model to use for transfer learning has major impact on the performance of the trained neural network.

The results of my neural network achieved with transfer learning based on the "InceptionResNetV2" model are very good and satisfying.
Accuracy, precision and recall are very good in both the validation and test dataset which I did not expect to be this good before first training and evaluating the model. The result was better than my expectations.

The performance of the "MobileNetV2" in terms of validation and test accuracy was even better!

InceptionResNetV2 vs MobileNetV2

Validation and Test Accuracy of the "InceptionResNetV2" are 98,5% and 98,6% and for the "MobileNetV2" they are 98,4% and 99,1%.
I did not expect the MobileNetV2 performing this good because it has way less layers and less mathematical operations than the InceptionResNetV2.

When high performance on accuracy, precision and recall is very important and of high priority, the InceptionResNetV2 is a very good choice for my facial recognition application. This model could be used when housedoors or doors in companies are opened based on a facial recognition application or in medical applications where False-Negatives can be fatal (For example: Prediction: No Melanoma, Ground-Truth: Melanoma).
When I want my application to be very reliable and trustworthy when it comes to predictions and classification, I would suggest to use the "InceptionResNetV2" as a basemodel for transfer learning. But the MobileNetV2 is performing similiar in terms of accuracy so both models will be a very good choice for this task.

But when computational power is limited, for example if we want to implement our application on an embedded system or on a mobile device, an efficient network model is needed which uses fewer mathematical operations and which is fast. In this case the MobileNetV2 network is the best choice and with only 14 MB model size it can be downloaded and implemented easily and fast on mobile devices or embedded systems.
When high priority is to have an fast and efficient model that comes with very few operational and resource needs I would suggest to use the "MobileNetv2" to use for transfer learning.

Both models are very good to implement my classification task.

Overfitting/ Underfitting

Overfitting occured when there was an additional fully connected layer before the prediction layer in the InceptionresNetV2. The amount of trainable parameters was too high then and so the model learnt the training data by hard. The training accuracy was very high then but the validation accuracy was bad.

On the MobileNetV2 an additional fully connected layer was needed after the flattened output of the base_model and before the prediction layer in order to get the right capacity and not having underfitting.

Possible solutions to predict my brother better as "Others"

To achieve even better performance when it comes to discerning my face from people looking a lot like me like my brother, there are several possibilities:

Training with GPU

First, the training of the models was done on my home computer, which does not has a Nvidia GPU. Training was very slow and to train the whole notebook it would have needed more than 24 hours. So my Early-Stopping patience parameter was not a high value at first (4-6) so that training was around 14-16 hours.
After that I was able to train my notebook on a friend“s computer with a Nvidia GPU: a Nvidia GeForce GTX 1080 Ti. Training both models with an early-stopping patience of 20 was done in approximately 1 hour 45 minutes and parameters could be tuned and different settings could be tested. So training on a Nvidia GPU is highly recommended for tasks like this.

The working on this project was very interesting and fun and I learned very much! I gained much interest in the world of machine learning and deep learning and I want to continue to focus on those topics in the future.